YANG, MOU, ZHANG ETAL.: FACE ALIGNMENT ASSISTED BY HEAD POSE ESTIMATION! 


;^ace Alignment Assisted by Head Pose 
Estimation 


I ^eng Yang^ 

heng.yang(3)cl.cam.ac.uk 

^^enxuan Mou^ 
w.mou(3)qmul.ac.uk 
^ichi Zhang^ 
j^yichizhang(a)fas.harvard.edu 

Iannis Patras^ 

^atras(3)qmul.ac.uk 

hRatice Gunes^ 

^gunes@qmul.ac.uk 

.I^eter Robinson^ 

3ter.robinson(3)cl.cam.ac.uk 


^ Computer Laboratory 
University of Cambridge 
Cambridge, UK 

2 School of EECS 
Queen Mary University of London 
London, UK 

^ Faculty of Arts & Sciences 
Harvard University 
Cambridge, MA, US 


m 

P 

o 



Abstract 

In this paper we propose supervised initialisation scheme for cascaded face alignment 
based on explicit head pose estimation. We first investigate the failure cases of most 
state of the art face alignment approaches and observe that these failures often share 
one common global property, i.e. the head pose variation is usually large. Inspired by 
this, we propose a deep convolutional network model for reliable and accurate head pose 
estimation. Instead of using a mean face shape, or randomly selected shapes for cascaded 
face alignment initialisation, we propose two schemes for generating initialisation: the 
first one relies on projecting a mean 3D face shape (represented by 3D facial landmarks) 
onto 2D image under the estimated head pose; the second one searches nearest neighbour 
shapes from a training set according to head pose distance. By doing so, the initialisation 
gets closer to the actual shape, which enhances the possibility of convergence and in 
turn improves the face alignment performance. We demonstrate the proposed method on 
the benchmark 300W dataset and show very competitive performance in both head pose 
estimation and face alignment. 


1 Introduction 

Both head pose estimation and face alignment have been well studied in recent years given 
their wide application in human computer interaction, avatar animation, and face recogni¬ 
tion/verification. These two problems are very correlated and putting them together will en¬ 
able mutual benefits. Head pose estimation from 2D images remains a challenging problem 
due to the high diversity of face images [O, HE] . Recent methods [HI]] attempt to estimate 
the head pose by using depth data. On the contrary, face alignment has made significant 
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Figure 1: Our proposed head pose based cascaded face alignment procedure (path in cyan 
color) vs. conventional cascaded face alignment procedure (path in red color). 


progress and several methods [□, ED, ED, EB] have reported good performance on images in 
the wild. However, they also show some failures. When we look into their failures cases, we 
find that those samples share one significant property, i.e., the head (face) in such images is 
usually rotated from frontal pose in big angles. 

The best performing face alignment methods proposed in recent years ([ED], [□] and [EB]) 
also share a similar cascaded pose regression framework, i.e., face alignment starts from a 
raw shape (a vector representation of the landmark locations), and updates the shape in a 
coarse to fine manner. The methods in this framework are usually initialisation dependent. 
Therefore, the final output of one cascaded face alignment system might change if a different 
initialisation is provided to the same input image. Moreover, each model has a convergence 
radius, i.e., if the initialisation lies within the range of the actual shape, the model will be 
able to output a reasonable alignment result, otherwise it might lead the shape to a wrong 
location, as shown in Fig. 1. The methods like [□, ED] perform initialisation using a mean 
shape within the face bounding box or from a randomly selected shape from training set. 
There is no guarantee the initialisation lies within the convergence radius, especially when 
head pose variation is large. 

In this paper, we aim to address the above discussed problems and make cascaded face 
alignment perform better under large head pose variations. The difference between our 
proposed method and the conventional cascaded method procedure is illustrated in Fig. 1. 
In contrast to using mean shape or random shapes for initialisation by other methods, our 
proposed method aims to produce better initialisation schemes for cascaded face alignment 
based on explicit head pose estimation. This is motivated by two facts: 1) most current meth¬ 
ods fail on face images with large head pose variation-as we will demonstrate later; 2) most 
recent face alignment methods work in a cascaded fashion and perform initialisation with 
mean shape. More specifically, we first estimate the head pose using a deep Convolutional 
Network (ConvNet) directly from face image. Given the estimated head pose, we propose 
two schemes of producing the initialisations. The first scheme projects a canonical 3D face 
shape under the estimated head pose to the detected face bounding box. The second scheme 
searches shape(s) for initialisation from the training set by nearest neighbour method in the 
head pose space. We build on our proposed scheme on the Robust Cascaded Pose Regression 
(RCPR) to demonstrate the effectiveness of supervised initialisation. We note that the pro¬ 
posed initialisation scheme can be naturally applied to any other cascaded face alignment. 
In summary, we make the following contributions: 
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• We investigate the failure cases of several state of the art face alignment approaches 
and find that the head pose variation is a common issue across those methods. 

• Based on the above observation, we propose a ConvNet framework for explicit head 
pose estimation. It is able to achieve an accuracy of 4° absolute mean error of head 
pose estimation for face images acquired in unconstrained environment. 

• We propose two initialisation schemes based on reliable head pose estimation. They 
enable face alignment method (RCPR) perform better and reduce large head pose fail¬ 
ures by 50% when using only one initialisation. 

To summarise, we propose better initialisation schemes based on explicit head pose estima¬ 
tion for cascaded face alignment, to improve the performance, especially in the case of large 
head pose variation. 


2 Related Work 

Face alignment has made considerable progress in the past years and a large number of meth¬ 
ods have been proposed. There are two different sources of information typically used for 
face alignment: face appearance (i.e., texture of the face image) and the shape information. 
Based on how the spatial shape information is used, the methods are usually categorized 
into local-based methods and holistic-based methods. The methods in the former category 
usually rely on discriminative local detection and use explicit deformable shape models to 
regularize the local outputs while the methods in the latter category directly regress the shape 
(the representation of the facial landmarks) in a holistic way, i.e. the shape and appearance 
are modelled together. 

2.1 Local-based methods 

Local based methods usually consist of two parts. One is for local facial feature detection, 
which is also called local experts and the other is for spatial shape models. The former 
describes how image around each facial landmark looks like in terms of local intensity or 
color patterns while the latter describes how face shape, that is the relative location of the 
face parts, varies. This captures variations such as wide forehead, narrow eyes, long nose 
etc. 

There are three types of local feature detection. (1) Classification methods include Sup¬ 
port Vector Machine (SVM) classifier [i, O] based on various image features such as Ga¬ 
bor [EE], SIFT [O, ED], HOG [El] and multichannel correlation filter responses [O]. (2) 
Regression-based approaches are also widely used. For instance. Support Vector Regres¬ 
sors (SVRs) are used in [\5B] with a probabilistic MRF-based shape model and Continuous 
Conditional Neural Fields (CCNF) are used in [□]. (3) Voting-based approaches are also 
introduced in recent years, including regression forests based voting methods [0, S, E3] and 
exemplar based voting methods [IZ3, IZ3]. 

One typical shape model is the Constrained Local Model (CLM) [□]. The CLM steps can 
be summarised as follows: first, sample a region from the image around the current estimate 
and project it into a reference frame; second, for each point, generate a “response image" 
giving a cost for having the point at each pixel; third, searching for a combination of points 
which optimises the total cost, by manipulating the statistical shape model parameters. The 
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methods built on CLM mainly differ from each other in terms of local experts, for instance 
CCNF in [□] and the Discriminative Response Map Fitting (DRMF) in [ffl] . There are many 
other local based methods either using CLM or other models such as RANSAC in [i], graph¬ 
matching in [EE], Gaussian Newton Deformable Part Model (GNDPM) [ES] and mixture of 
trees [E3]. 

2.2 Holistic-based methods 


Table 1: Holistic methods and their properties. 


Methods 

SDM [ED] 

RCPR [□] 

IFA [□] 

LBF [ED] 

CFAN [ED] 

TCDCN [E3] 

initialisation 

mean pose 

random 

mean pose 

mean pose 

supervised 

supervised 

features 

SIFT 

pixel 

HOG 

pixel 

auto-encoder 

ConvNet feature 


regressor linear regression random ferns linear regression random forests linear regression ConvNet 


Holistic methods have gained high popularity in recent years and most of them work in 
a cascaded way like SDM [ED] and RCPR [Q]. We list very recent holistic methods as well 
as their properties in Table 1 . The methods following the cascaded framework differ from 
each other mainly in three aspects. First, how to set up the initial shape; Second, how to cal¬ 
culate the shape-indexed features; Third, what type of regressor is applied at each iteration. 
For initialisation, there are mainly three strategies are proposed in literature: random, mean 
pose, and supervised. In order to make it less sensitive to initialisation, previous approaches 
such as [□, m propose to run multiple different initialisations and pick the median of all 
the predictions as the final output. Each initialisation is treated independently way until the 
output is calculated. However, such a strategy has several issues, first the theoretical support 
for selecting the median value is not well understood; second, there is no guidance on how 
to choose the multiple initialisations; third, using multiple initialisations is computationally 
expensive. A similar supervised initialisation scheme was proposed in [E3] where the ini¬ 
tialisation shapes were selected by using an additional regression forest model for sparse 
facial landmarks estimation. A recent work [E3] proposed a re-initialisation scheme based 
on mirrorability to improve the face alignment performance. 


3 Data preparation 

In this section we describe how the data is prepared in order to support our further discussion. 
More specifically, we discuss how we provide ground truth head pose and face bounding 
boxes from different face detectors for the benchmark dataset. 

We use face image data from the benchmark face alignment in the wild dataset, 300W 
[ED]. Since their testing samples are not publicly available, we follow the partition of recent 
methods [ED] to set up the experiments. More specifically, we use face images from AFW 
[ED], HELEN [E3], LFPW [i] and iBug [EZD], which include 3148 training images and 689 
test images in total. 3148 training images are from AFW (337 images), HELEN training set 
(2000 images) and LFPW training set (811 images), and 689 test images are from HELEN 
test set (330 images), LFPW test set (224 images) and iBug (135 images). 

It is intractable to get the ground truth 3D head pose for face images collected in uncon¬ 
strained conditions. In order to generate reasonable head pose (Pitch, Yaw and Roll) values, 
we use the pose estimator provided by Supervised Descent Method (SDM) [ED]. Note that, 
when calculating the head pose, we feed the ground truth facial landmark locations instead of 
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Figure 2: Distribution of the most erroneous samples. 


using the detected landmarks. Technically, head pose is estimated by solving the projection 
function from an average 3D face model (49 3D points) to the input image, given the 3D to 
2D correspondences. We also use the 3D head pose estimator provided by [ffl] for head pose 
calculation for evaluating the results. It produces very similar results to [ED]. We calculate 
the head pose for all images in 300W. 

The benchmark dataset only provides two types of face bounding boxes: one is the 
ground truth bounding box calculated as the tight box of the annotated facial landmarks; 
the other is the detection results from model of mi which is quite similar to the ground 
truth face bounding box. However, several models like SDM [ED] and RCPR [□] are trained 
with different face bounding boxes, thus their performance deteriorates significantly when 
using the provided face bounding boxes. We therefore provide different face bounding boxes 
to the test images by employing Viola-Jones detector [123] and HeadHunter detector [O] for 
fair comparison. For the input images on which the face detector fails we manually set 
reasonable bounding boxes. 


4 Method 

4.1 Motivation 

We first run several state of the art methods, including 6 holistic based methods (SDM 
[ED], IFA [□], LBF [ED], CFAN [ES], TCDCN [E3], RCPR [□]) and 3 local based meth¬ 
ods (GNDPM [123], DRMF [ffl], CCNF [□]) given their good performance and availability of 
source code. For each method, we provide the best type of face bounding boxes in order to 
get the best performance. For each method, we select 50 difficult samples out of the 689 test 
samples that provide the biggest sample-wise alignment error. Then we plot their head poses 
in Fig. 2 (left). As can be seen, most of the points are far away from the original point, i.e. 
they have big rotation angle(s). We further plot the histogram of the biggest absolute rotation 
angles of those samples in Fig. 2 (right). The biggest absolute rotation angle is calculated as 
the one of the three directions with the biggest absolute value. As can be seen, those samples 
are distributed at big absolute angles. There are very few samples that have small rotation 
angles. Based on this observation, we can conclude that, large head pose rotation is one 
of the main factors that make most of the current face alignments fail. Based on this fact, 
we develop a head pose based initialisation scheme for improving the performance of face 
alignment under large head pose variations. 
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Figure 3: ConvNet model for head pose estimation. 


4.2 Head Pose Estimation 

Giving the training data from 300W with augmented head pose annotation, we train a convo¬ 
lutional network (ConvNet) [lOI] model for head pose estimation on the training set of 300W 
with 3148 images. The samples are augmented by 3 times with small permutations on the 
face bounding box. The ConvNet structure is shown is shown in Fig. 3. The input of the 
network is 96x96 gray-scale face image , normalised to the range between 0 and 1. The 
feature extraction stage contains three convolutional layers, three pooling layers, two fully 
connected layers and three drop-out layers. As we pose it as a regression problem, the output 
layer is 3x1 representing the head pose pitch, yaw and roll angle respectively. The angles 
are normalised between -1 and 1. We use Nesterov’s Accelerated Gradient Descent (NAG) 
method [El] for parameter optimisation and we set the momentum to 0.9 and learning rate 
to 0.01. The training finishes in two hours on Tesla K40c GPU after around 1300 epochs, 
controlled by early-stop strategy. The learning curve is shown in Fig. 4 (left). The forward 
propagation of this network on GPU only takes 0.3ms per image on average. 

4.3 Pose based Cascaded Face Alignment 

4.3.1 General Cascaded Face Alignment 

In order to make this work stand alone, we first summarise the general framework of cas¬ 
caded face alignment. Face shape is often represented as a vector of landmark locations, 
i.e., S = (xi, ...,Xy^, ...,xk) G where K is the number of landmarks, G is the 2D 
coordinates of the ^-th landmark. Most of the current holistic-based method works in a 
coarse-to-fine fashion, i.e., shape estimation starts from an initial shape S^ and progressively 
refines the shape by a cascade of T regressors, . Each regressor refines the shape by 
producing an update, A^, which is added on the current shape estimate, that is, 

S^=S^-^^AS. ( 1 ) 

The update A^ returned from the regressor that takes the previous pose estimation and the 
image feature 1 as inputs: 

AS = R\S‘~\l) (2) 

An important aspect that differentiates this framework from the classic boosted approaches 
is the feature re-sampling process. More specifically, instead of using the fixed features, the 
input feature for regressor is calculated relative to the current pose estimation. This is 
often called pose-indexed feature as in [□]. This introduces weak geometric invariance into 
the cascade process and shows good performance in practice. The CPR is summarized in 
Algorithm 1 [0]. 
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Algorithm 1 Cascaded Pose Regression 

Require: Image /, initial pose S^ 


Ensure: Estimated pose S^ 


1: for t=\ to T do 


1 

II 

(N 

> Shape-indexed features 

3: AS = R‘{f) 

> Apply regressor 

4: S‘=S‘-^+AS 

> update pose 

5: end for 



4.3.2 Head Pose based Cascaded Face Alignment 

In section 4.2 we have presented how a ConvNet model can be used for head pose estimation. 
We propose two head pose based initialisation schemes for face alignment. One is based on 
an average 3D face shape projection and the other is based on nearest neighbour searching. 

Scheme 1: 3D face shape based initialisation Given a 3D mean face shape, represented 
by 68 3D facial landmark locations, as shown in Fig. 1, we first project this shape under the 
estimated head pose to a set of canonical 2D locations. More specifically we use constant 
translation and focus length in order to get a reasonable projection for all images. Then we 
re-scale the canonical 2D projection by the face bounding box scale of the test image to get 
the initialisation. We can represent the initialisation process by function T as follows. 

So = T{e,bb,s^^) (3) 

with bb the face bounding box, 5^^, the 3D mean face shape, 0, the estimated head pose, 
which can be represented by: 

e = g{i,bb) (4) 

where Q is the deep convolutional model described in section 4.2. 

Scheme 2: Nearest Neighbour based initialisation We propose a second scheme for head 
pose based initialisation by nearest neighbour search. Since we have provided the training 
samples with head pose information as well, we can easily search samples that are with 
similar head pose of a test sample. Then we calculate similarity transformation between 
two face bounding boxes in order to calculate the initialisation shape for the test sample. In 
this way, we can also provide K initialisations by searching k-Nearest Neighbors from the 
training set. 

Once we get a reliable initialisation (or several ones), we feed it to Algorithm 1 and 
apply the cascade of regressors in the same way to the baseline approach. In the case of 
the multiple initialisations, we calculate the output in a similar fashion to [B, 123], i.e., to 
pick up the median value of their estimations. We build our proposed head pose based 
initialisation schemes on top of the popular Cascaded Pose Regression (CPR) method due to 
its simplicity and popularity. We train its recent variant Robust Cascaded Pose Regression 
(RCPR) [□] model by using its new interpolated feature extraction, which is re-implemented 
by the author of [E3]. We do not use its full version as occlusion status annotation is not 
available. We trained the baseline RCPR model on our 300W training set using Viola-Jones 
[123] face detection. 20 random initialisations are used for data augmentation at the training 
time. 
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Figure 4: Head pose estimation result. Left, learning curve of head pose network, with y 
axis the Root Mean Square Error (RMSE) and x axis the number of epochs; middle, absolute 
mean error on test set; right, example results of head pose estimation. 


5 Evaluation 

5.1 Head Pose Estimation 

We first evaluate the performance of head pose estimation. As we discussed before, it is very 
difficult to get the ground truth head pose for face images acquired in uncontrolled condi¬ 
tions. We calculate the pose based on the annotated facial landmark locations. We apply the 
trained deep ConvNet model on the test images of 300W and measure the performance. The 
result is shown in Fig. 4. The absolute mean errors of the head pose pitch, yaw, roll angles 
are 5.1°, 4.2° and 2.4°, respectively. Some example results are shown on the right. Despite 
the work by Zhu & Ramanan [E3] is conceptually similar to our work in terms of simutaneu- 
ous head pose and facial landmarks estimation, we do not compare to it here because their 
work can only estimate very sparse head pose yaw angles (e.g. -15°, 0° , 15 ° ). 

5.2 Face Alignment 

We first show the effectiveness of head pose based initialisation by comparing with the base¬ 
line strategy of the CPR framework [□, E3], i.e., generating random initialisations from train¬ 
ing samples. The comparison is shown in Fig. 5. As can be seen on the left figure, by using 
one initialisation projected from 3D face shape, we obtain similar performance to the base¬ 
line approach with 5 initialisation shapes, and much better performance than that uses only 
one random initialisation shape. Similar superior performance is obtained by using near¬ 
est neighbour initialisation scheme, as shown on the right. By using more head pose based 
initialisations, we gain even better results, though the improvement is minor. It is worthy 
noting that by using our proposed initialisation scheme, we are able to decrease the number 
of failure cases (sample-wise average alignment error >0.1) from 130 to 69 (scheme 1) and 
from 130 to 72 (scheme 2), nearly 50%. Those samples are usually with large head pose 
variations and difficult for conventional face alignment methods. Moreover, by using one set 
of initialisation, the whole test procedure on one typical image takes 3.8 ms (0.3 ms for head 
pose estimation and 3.5 ms for cascaded face alignment). 

We further compare the proposed method with recent state of the art methods including 
5 holistic based methods (SDM [ED], IFA [□], LBF [ED], CFAN [EE], TCDCN [E3]) and 3 
local based methods (GNDPM [123], DRMF [ffl], CCNF [□]). SDM and DRMF are trained 
using the Multi-PIE [O] dataset and detect 49 and 66 facial landmarks respectively. The rest 
of them are with models trained on 300W datasets. When we run their model on the test 
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Figure 5: Our proposed head pose based initialisation scheme vs. random initialisation 
scheme. Left, our 3D face shape based scheme; right, our Nearest Neighbour (NN) based 
scheme. 




Figure 6: Comparison with recent methods. Left, results from the best face detection of 
each method; right, results from the common HeadHunter face detection. Pose-RCPR is our 
proposed method using only 1 initialisation from 3D. 


images, we use the best bounding boxes for a fair comparison. Best bounding box refers to 
Viola-Jones detection for SDM and RCPR and tight face detection provided by 300w dataset 
for the rest of them. The comparison is shown in Fig. 6. As can be seen, our proposed 
method shows competitive performance. We also compare the performance on another type 
of common face detection, HeadHunter, given its best performance in face detection. The 
result is shown on the right of Fig. 6. We observe that the performance of most methods 
deteriorate significantly when testing on HeadHunter face bounding boxes. Our method 
provides most stable result, despite the fact that the HeadHunter face bounding box is more 
overlapped with the face detection from 300W (both are tight boxes of facial landmarks) than 
with Viola-Jones face detection. We believe this robustness to face bounding box changes is 
partially due to our head pose based initialisation strategy. 
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6 Conclusion and Future Work 

In this paper we first demonstrate that most recent face alignment methods show failure cases 
when large head pose variation is present. Based on the fact that cascaded face alignment 
is initialisation dependent, we proposed supervised initialisation schemes based on explicit 
head pose estimation. We use deep convolutional networks for head pose estimation and pro¬ 
duce initialisation shape by either projecting a 3D face shape to the test image or searching 
nearest neighbour shapes from the training set. We demonstrated that using a more reliable 
initialisation is able to improve the face alignment performance with around 50% failure 
decreasing. It also shows comparable or better performance when comparing to recent face 
alignment approaches. 

Although we have managed to decrease the failure cases to a certain degree, we have 
not fully solved this problem. There are several interesting directions for future research. 
First, using head pose based initialisation shapes in the training stage may further boost the 
performance. Second, we only test our method on RCPR, we believe the proposed scheme 
can be naturally applied to other cascaded face alignment methods. It also raises several 
interesting questions. Do we need to make the cascaded learning model better for face 
alignment or to make the initialisation more reliable? Do we need more uniformly distributed 
data or a better model in order to make face alignment work better in wider range of head 
pose variations? We are going to investigate on these problem in our future research. 
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